Instructions

The following tutorial will let you reproduce the plots that we are going to create at the workshop using R.
Please read carefully and follow the steps. Wherever you see the Code icon on the right you can click on it to see the actual code used in that section.

Introduction

This tutorial will focus on analysing the updated data of the worldwide Novel Corona virus (COVID-19) pandemic.
There are several data sources available online. We will use the data collected from a range of sources by the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE) and hosted on their GitHub repository. We will also use weekly COVID-19 data from the European Centre for Disease Prevention and Control website Note that both data sources are lagging 1-2 days behind our current date!!

Analyse Data in R

To run R and RStudio on Binder, click on this badge - Launch Rstudio Binder.

Start RStudio and create a new project named Workshop3 in a new folder (if you need a reminder ho to do it, check out Workshop1 Tutorial on BB).
Once RStudio restarts inside the project’s folder, create a new R script named Workshop3.R and 2 new folders, one named data for our input data and another named output for our plots.

Install Extra Packages

For this analysis we will again use some packages from the Tidyverse, but this time we load the specific packages (which are supposed to be pre-installed on your computers) to try and avoid having to download the entire tidyverse. In addition to the Tidyverse packages we’ve got to know in the previous workshop we will use the plotly package to create interactive plots, paletteer for custom color palettes, readxl to read MS-Excel file, scales to format large numbers, lubridate to better handle dates, glue to paste together strings, highcharter to plot the data on a world map and a few others to assist in getting the data into shape.
To install these packages, we will introduce a package called pacman that will assist in loading the required packages and installing them if they’re not already installed. To install it we use the install.packages('pacman') command, please note that the package name need to be quoted and that we only need to perform it once, or when we want or need to update the package. Once the package was installed, we can load its functions using the library(pacman) command and then load/install all the other packages at once with p_load() function.

# install required packages - needed only once! (comment with a # after first use)
install.packages('pacman')
# load required packages
library(pacman)
required_packages <- c("dplyr", "tidyr", "ggplot2", "paletteer",
                       "stringr", "readr","readxl", "glue",
                       "highcharter", "forcats", "scales",  
                       "plotly", "lubridate", "here", "ISOweek",
                       "countrycode")
p_load(char=required_packages)

More information on installing and using R packages can be found in this tutorial.

Read Data

Now that we’ve got RStudio up and running and our packages installed and loaded, we can read data into R from our local computer or from web locations using dedicated functions specific to the file type (.csv, .txt, .xlsx, etc.). Please download the data files from Blackboard and put them in the data folder.

We will use the read_csv() command/function from the readr package (part of the tidyverse) to load the data from a file on our computer into a variable of type data frame (table). If we don’t want to use external packages, we can use the read.csv() function from base R, which will slightly change the structure of the resulting data frame (will convert all text columns into factors and won’t automatically parse columns containing dates).
> Note that the path to the files can be either relative or absolute to our current working directory and that we use a forward slash / (Unix/Mac style) and not a backslash \ (Windows style), to make life easier, make use of RStudio’s auto-completion feature and the here package._

# read data from the 'data' folder
covid_data <- read_csv(here("data/csse_country_covid_data_18_07_2021.csv")) %>% 
  rename(Country_Region=`Country/Region`)

Data Exploration

Let’s use built-in functions for a brief data exploration, such as head() to show the first 10 rows of the data and str() for the type of data in each column:

#explore the data frame
head(covid_data) # show first 10 rows of the data and typr of variables
## # A tibble: 6 x 5
##   Country_Region Date       Confirmed Deaths Recovered
##   <chr>          <date>         <dbl>  <dbl>     <dbl>
## 1 Afghanistan    2020-01-22         0      0         0
## 2 Afghanistan    2020-01-23         0      0         0
## 3 Afghanistan    2020-01-24         0      0         0
## 4 Afghanistan    2020-01-25         0      0         0
## 5 Afghanistan    2020-01-26         0      0         0
## 6 Afghanistan    2020-01-27         0      0         0
str(covid_data) # show data structure
## spec_tbl_df [106,080 x 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Country_Region: chr [1:106080] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ Date          : Date[1:106080], format: "2020-01-22" "2020-01-23" ...
##  $ Confirmed     : num [1:106080] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Deaths        : num [1:106080] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Recovered     : num [1:106080] 0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   `Country/Region` = col_character(),
##   ..   Date = col_date(format = ""),
##   ..   Confirmed = col_double(),
##   ..   Deaths = col_double(),
##   ..   Recovered = col_double()
##   .. )

Descriptive Statistics

We can also produce some descriptive statistics to better understand the data and the nature of each variable. The summary() function (as can be guessed by its name) provides a quick summary of basic descriptive statistics, such as the mean, min, max and quantiles for continuous numerical values.

# summary of variables in my data
summary(covid_data)
##  Country_Region          Date              Confirmed            Deaths      
##  Length:106080      Min.   :2020-01-22   Min.   :       0   Min.   :     0  
##  Class :character   1st Qu.:2020-06-05   1st Qu.:     311   1st Qu.:     3  
##  Mode  :character   Median :2020-10-19   Median :    7835   Median :   127  
##                     Mean   :2020-10-19   Mean   :  332315   Mean   :  7924  
##                     3rd Qu.:2021-03-04   3rd Qu.:  100331   3rd Qu.:  1821  
##                     Max.   :2021-07-18   Max.   :34079960   Max.   :609019  
##    Recovered       
##  Min.   :       0  
##  1st Qu.:      43  
##  Median :    3915  
##  Mean   :  198460  
##  3rd Qu.:   60374  
##  Max.   :30308456

We can see that most of our data contains ‘0’ (check the median of Confirmed column). Just to confirm that, let’s plot a histogram of all the confirmed cases

ggplot(covid_data, aes(x=Confirmed)) +
  geom_histogram(fill="lightskyblue") +
  theme_bw(16)

What are the metadata columns that describe our observations?

Country 
Date

The data is evolving over a time-series, to there’s no point treating it as a random population sample.

Time-series plot

Let’s look at confirmed cases data for the 10 most affected countries (to date). To find out these countries so we need to wrangle our data a little bit using the following steps:

  1. First we group it by Country with group_by()
  2. Then we sort it within each Country by Date (from latest to earliest) with arrange(desc(Date))
  3. We select just the most recent data with slice(1) and remove grouping with ungroup()
  4. Next we arrange it by descending order of confirmed cases and select the top 10
  5. We extract just the Country information as a vector of values using the $ sign
  6. Finally, we subset our original data to contain just the countries from our vector with filter()

Optional step:

  1. We can reorder the countries so they will be ordered in the legend by the number of cases with fct_reorder()

Then we can look at the data as a table and make a plot with the number of cases in the y-axis and date in the x-axis.

# find the 10 most affected countries (to date)
latest_data <- covid_data %>% group_by(Country_Region) %>% arrange(desc(Date)) %>% slice(1) %>% ungroup() 
most_affected_countries <- latest_data %>% arrange(desc(Confirmed)) %>% slice(1:10) %>% .$Country_Region
# have a look at the data as a table
latest_data %>% arrange(desc(Confirmed)) %>% filter(Country_Region %in% most_affected_countries) 
## # A tibble: 10 x 5
##    Country_Region Date       Confirmed Deaths Recovered
##    <chr>          <date>         <dbl>  <dbl>     <dbl>
##  1 US             2021-07-18  34079960 609019         0
##  2 India          2021-07-18  31144229 414108  30308456
##  3 Brazil         2021-07-18  19376574 542214  17304814
##  4 France         2021-07-18   5929929 111662    408739
##  5 Russia         2021-07-18   5884593 145975   5277561
##  6 Turkey         2021-07-18   5522039  50488   5380752
##  7 United Kingdom 2021-07-18   5455276 128988     16430
##  8 Argentina      2021-07-18   4756378 101549   4394695
##  9 Colombia       2021-07-18   4639466 116307   4389277
## 10 Italy          2021-07-18   4287458 127867   4113478
# subset just the data from the 10 most affected countries and order them from the most affected to the least one
most_affected_data <- covid_data %>% 
  filter(Country_Region %in% most_affected_countries) %>% 
  mutate(Country=fct_reorder(factor(Country_Region), Confirmed, .desc = TRUE))

# create a line plot the data
ggplot(most_affected_data, aes(x=Date, y=Confirmed, colour=Country)) +
  geom_line(size=1) + scale_y_continuous(labels=comma) + 
    scale_color_paletteer_d("ggsci::springfield_simpsons") +
  labs(color="Country") +
  theme_bw(16)

It’s a bit hard to figure out how the pandemic evolved because the numbers in US, Brazil and India are an order of magnitude larger than the rest (which are very close to each other). How can we make it more visible (and also extend the details of the Date scale)?

# create the plot
plot <- ggplot(most_affected_data, aes(x=Date, y=Confirmed, colour=Country)) +
  geom_line(size=0.6) + scale_y_log10(labels=comma) +
  scale_x_date(NULL,
    breaks = breaks_width("2 months"), 
    labels = label_date_short()) + 
    scale_color_paletteer_d("ggsci::springfield_simpsons") +
  labs(color="Country") +
  theme_bw(16)
# show an interactive plot
ggplotly(plot)
## Warning: Transformation introduced infinite values in continuous y-axis

Why did we get a warning message? How can we solve it? What can we infer from the graph (exponential increase)?

What happens when we take the log of 0?? Can we remove those 0s with the `filter()` function?
We can see a very similar trend for most countries and while the curve has flattened substantially in April last year, the numbers are still rising. It is also evident that Europe got hit by a second wave arount October last year.

Baseline comparison

To be able to compare the disease progress between countries, we could “normalise” the data to show the epidemic spread using a similar baseline, for example, counting the days since the 100th confirmed case in each country.
Make sure to label the axes appropriately!!

include_tweet("https://twitter.com/OdedRechavi/status/1417163305085460487?s=20")
baselined_data <- most_affected_data %>% filter(Confirmed>=100000) %>% group_by(Country_Region) %>% 
  mutate(Days_since_100k=difftime(Date,dplyr::first(Date), units = "days")) %>% ungroup()

country_order <- baselined_data  %>% filter(Days_since_100k==300) %>% arrange(desc(Confirmed)) %>% 
  select(Country_Region)
# create the plot
plot <- ggplot(baselined_data, aes(x=Days_since_100k, y=Confirmed, colour=factor(Country, levels=country_order$Country_Region))) +
  geom_line(size=0.6) + scale_y_log10(labels=comma) +
    scale_color_paletteer_d("ggsci::springfield_simpsons") +
  labs(color="Country", x="Days since the 100,000th patient") +
  theme_bw(16)
# show an interactive plot
ggplotly(plot)

Analyse by Population size

We can also look at the change in the number of cases relative to the population size of each country. To do so we will read another dataset of weekly new cases collected by the European Centre for Disease Prevention and Control, which include the population of each country (as of 2019).
We can read it directly from their website as a .csv file (as below) or we can download the most recent daily COVID-19 data from the website (download the Excel file and place it in the data folder, then use read_excel() instead of read_csv()).
We need to wrangle this one up as well to make sure that the countries names are consistent (try to open the file in excel file and find the issue) and the dates are parsed correctly. We will then calculate the number of accumulated cases and deaths using mutate(Total_cases=cumsum(Cases), Total_Deaths=cumsum(Deaths) as well as the number of cases per million people in each country.

eu_data <- read_csv("https://opendata.ecdc.europa.eu/covid19/nationalcasedeath/csv", 
                    na = "") 
glimpse(eu_data)
## Rows: 32,092
## Columns: 10
## $ country          <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghani~
## $ country_code     <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG~
## $ continent        <chr> "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asia~
## $ population       <dbl> 38928341, 38928341, 38928341, 38928341, 38928341, 389~
## $ indicator        <chr> "cases", "cases", "cases", "cases", "cases", "cases",~
## $ weekly_count     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 3, 12, 18, 80, 185, 308, 3~
## $ year_week        <chr> "2020-01", "2020-02", "2020-03", "2020-04", "2020-05"~
## $ rate_14_day      <dbl> NA, 0.000000000, 0.000000000, 0.000000000, 0.00000000~
## $ cumulative_count <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 4, 16, 34, 114, 299, 607, ~
## $ source           <chr> "Epidemic intelligence, national weekly data", "Epide~
covid_weekly_data <- eu_data %>% 
  rename(Country=country) %>% select(!any_of(c("weekly_count", "rate_14_day"))) %>% 
  mutate(Country=str_to_title(gsub("_", " ", Country)),
         Date = ISOweek2date(sub("-([0-9]{2})", "-W\\1-5", year_week))) %>% # 
  pivot_wider(names_from = "indicator", values_from = "cumulative_count") %>% 
  filter(!grepl("(Total)", Country, fixed = TRUE)) %>% 
  group_by(Country) %>% arrange(Country, Date) %>% # slice(1) %>% 
  mutate(Total_cases=cases, Total_Deaths=deaths,
         Cases_per_mill=Total_cases*1e6/population) %>% 
  ungroup()

# use the line below if using the Excel file downloaded from the web page  
# read_excel("data/COVID-19-geographic-disbtribution-worldwide-2020-03-22.xlsx") 
  
  #        iso3c=countrycode(GeoId, origin = 'iso2c', destination = 'iso3c')) %>% 
  # ungroup()

Similar to what we did earlier, we’ll group the data by Country and sort it by Date to select the most recent data that we’ll use to select the most affected countries.

# find the 10 most affected countries and order them from the most affected to the least one
worst_by_pop_data <- covid_weekly_data %>% group_by(Country) %>% 
  arrange(desc(Date)) %>% slice(1) %>% 
  ungroup() %>% filter(population>100000) %>% arrange(desc(Cases_per_mill)) %>%
  slice(1:10) 

# subset the weekly data 
plot_data <- covid_weekly_data %>% filter(Country %in% worst_by_pop_data$Country) %>% 
  mutate(Country=factor(Country, levels = worst_by_pop_data$Country)) %>% 
  filter(Date>as.Date("2020-04-01"))
# create the plot
ggplot(plot_data, aes(x=Date, y=Cases_per_mill, colour=Country)) +
  geom_line(size=0.6) + scale_y_continuous(labels=comma) +
  scale_x_date(NULL,
    breaks = breaks_width("2 months"), 
    labels = label_date_short()) + 
    scale_color_paletteer_d("rcartocolor::Bold") + # ggsci::springfield_simpsons
  labs(x="Date", y="Cases per million people") +
  theme_bw(16)

# create output folder
dir.create("./output", showWarnings = FALSE)
# save the plot to pdf file
ggsave("output/cases_per_million_worst_countries.pdf", width=8, height = 6)

Questions

  1. Why didn’t we use the vector of countries that we identified earlier from the original dataset?
  2. What other metrics we could analyse?
  3. What we should take into account that might bias the results or the true status of the pandemic?
1. Because the country names do not necessarily match (check how USA appears in both datasets)  
2. Mortalities (Case Fatality Rate), cases per population density (population/area)?  
3. Suggestions?

We can look at the average cases by continent and how the pandemic spread across the globe and the different responses of the continents.
Check at the following BBC site and check roughly when lockdowns and travel restrictions were applied and plot it.

cont_weekly_data <- covid_weekly_data %>% group_by(continent, Date) %>% 
  summarise_at(c("Total_cases", "Total_Deaths"), ~mean(., na.rm=TRUE)) %>% filter(Total_cases>1) %>% 
  ungroup()
# create the plot
cont_plot <- ggplot(cont_weekly_data, aes(x=Date, y=Total_cases, colour=continent)) +
  geom_line(size=0.6) + 
  geom_segment(aes(x = dmy("15/03/2020"), xend = dmy("15/03/2020"), y = 1, yend = 100000), colour="red", linetype ="dashed", size=0.25) + 
  scale_y_log10(labels=comma) +
    scale_color_paletteer_d("rcartocolor::Bold") +
  scale_x_date(NULL,
    breaks = scales::breaks_width("2 months"), 
    labels = scales::label_date_short() ) +
  labs(color="Continent", y="Mean confirmed cases") +
  theme_bw(16)
# show an interactive plot
ggplotly(cont_plot)

Maps

We can use the same data to visualise the impact of COVID-19 on a global/regional map to grasp its worldwide effect.

Choroplet Map by Country

We will create a color scale to represent the number of cases and appropriate popup information tags for each country.

# download map data
world_map <- download_map_data("custom/world-palestine-highres")
mapdata <- get_data_from_map(world_map)

# select only most current data
choroplet_data <- covid_weekly_data %>%  group_by(Country) %>% arrange(desc(Date)) %>% 
  slice(1) %>% filter(Total_cases>0) %>% 
  mutate(log_cases=log(Total_cases),
         CFR=scales::percent(Total_Deaths/Total_cases, accuracy=.01),
         Total_cases=scales::comma(Total_cases, accuracy=1), 
         Total_Deaths = scales::comma(Total_Deaths, accuracy=1), 
         Cases_per_mill=scales::number(Cases_per_mill, accuracy = .01,big.mark = ","), 
         pop=scales::comma(population, accuracy=1),
         geoID=countrycode(country_code, origin = 'iso3c', destination = 'iso2c', 
                           warn = FALSE),
         geoID=case_when(country_code=="XKX"~"KV",
                         geoID=="UK"~"GB",
                         geoID=="EL"~"GR",
                         geoID=="PS"~"WE",
                         geoID=="JPG11668"~"UM",
                         TRUE~geoID))


# # define colors

bins <- 10^(0:8)# c(1, 10, 100, 1000, 10000, 100000, 1000000)

cols <- as.character(paletteer_c("viridis::inferno", length(bins), direction = -1))
stops <- data.frame(q=1:length(bins)/length(bins), c=cols) %>% list_parse2(.)

# plot map
  highchart(type = "map", hc_opts = list(caption=glue("<b>Number of confirmed COVID-19 cases by Country</b><br/>An up-to-date summary of daily data obtained from the European Centre for Disease Prevention and Control<br/>Data was last updated on {format(max(choroplet_data$Date), '%d/%m/%Y')}."))) %>%
    hc_add_series_map(map = world_map, df = choroplet_data,
                      value = "log_cases", joinBy = c("iso-a2", "geoID")) %>%
    hc_colorAxis(stops  = color_stops(length(bins), cols), tickPositions=log(bins), showLastLabel=FALSE, labels=list(formatter=JS("function(){ 
    var n=Math.exp(this.value);
    if (n < 1e3) return n.toFixed(0);
    if (n >= 1e3 && n < 1e6) return +(n / 1e3).toFixed(1) + 'k';
    if (n >= 1e6 && n < 1e9) return +(n / 1e6).toFixed(1) + 'M';
    if (n >= 1e9 && n < 1e12) return +(n / 1e9).toFixed(1) + 'B';
    if (n >= 1e12) return +(n / 1e12).toFixed(1) + 'T';}"))) %>%
    hc_tooltip(useHTML=TRUE,headerFormat='',
               pointFormat = '{point.Country} confirmed cases : <span style="     border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: gold !important;" >{point.Total_cases}</span><br/>Deaths: <span style="     color: white !important;border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: orangered !important;" >{point.Total_Deaths} (CFR {point.CFR})</span><br/>Population (2019): <b>{point.pop}</b><br/>Cases per million: <b>{point.Cases_per_mill}</b>') %>%
    hc_mapNavigation(enabled = TRUE) %>%
  hc_title(text = "Number of confirmed COVID-19 cases by Country",
           margin = 40, align = "left",
           style = list(color = "#2b908f", useHTML = TRUE)) %>%
  hc_subtitle(text = glue('Data origin: <a href="https://www.ecdc.europa.eu/en/publications-data/data-national-14-day-notification-rate-covid-19">European Centre for Disease Prevention and Control</a> (updated on {format(max(choroplet_data$Date), "%d/%m/%Y")})'),
              align = "left",
              style = list(color = "#2b908f", fontWeight = "bold")) %>%
  hc_exporting(enabled = TRUE)

What else can we plot or investigate?

1. Mortalities (Case Fatality Rate), cases per population density (population/area)?  
2. Hospitalisation data  
3. Vaccination data  

Additional Resources

  • Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE) data repository and website
  • EU 14-days COVID-19 data for download in CSV/Excel format link
  • Be awesome in ggplot2: A Practical Guide to be Highly Effective – R software and data visualization (link)
  • COVID-19 vaccination data in Our World in Data site
  • My very own COVID-19 dashboard (created in R)

Contact

Please contact me at i.bar@griffith.edu.au for any questions or comments.